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Abstract 

Many exome sequencing studies of Mendelian disorders fail to optimally exploit family information. Classical 
genetic linkage analysis is an effective method for eliminating a large fraction of the candidate causal variants 
discovered, even in small families that lack a unique linkage peak. We demonstrate that accurate genetic linkage 
mapping can be performed using SNP genotypes extracted from exome data, removing the need for separate 
array-based genotyping. We provide software to facilitate such analyses. 



Background 

Whole exome sequencing (WES) has recently become a 
popular strategy for discovering potential causal variants 
in individuals with inherited Mendelian disorders, pro- 
viding a cost- effective, fast-track approach to variant 
discovery. However, a typical human genome differs 
from the reference genome at over 10,000 potentially 
functional sites [1]; identifying the disease-causing muta- 
tion among this plethora of variants can be a significant 
challenge. For this reason, exome sequencing is often 
preceded by genetic linkage analysis, which allows var- 
iants outside of linkage peaks to be excluded. The link- 
age peaks delineate tracts of identity by descent sharing 
that match the proposed genetic model. This combina- 
tion strategy has been successfully used to identify var- 
iants causing autosomal dominant [2-4] and recessive 
[5-11] diseases, as well as those affecting quantitative 
traits [12-14]. Linkage analysis has also been used in 
conjunction with whole genome sequencing (WGS) [15]. 

Other WES studies have not performed formal linkage 
analysis, but have nonetheless considered inheritance 
information, such as searching for large regions of 
homozygosity shared by affected family members using 
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genotypes obtained from genotyping arrays [16-18] or 
exome data [19,20]. This method does not incorporate 
genetic map or allele frequency information, which 
could help to eliminate regions from consideration, and 
is applicable only to recessive diseases resulting from 
consanguinity. Recently, it has been suggested that iden- 
tity by descent regions be identified from exome data 
using a non-homogeneous hidden Markov model 
(HMM), allowing variants outside these regions to be 
eliminated [21,22]. This method incorporates genetic 
map information but not allele frequency information 
and requires a strict genetic model (recessive and fully 
penetrant) and sampling scheme (exomes of two or 
more affected siblings must be sequenced). It would be 
suboptimal for use with diseases resulting from consan- 
guinity, for which filtering by homozygosity by descent 
would be more effective than filtering by identity by des- 
cent. Finally, several WES studies have been published 
that make no use of inheritance information whatsoever, 
despite the fact that DNA from other informative family 
members was available [23-31]. 

Classical linkage analysis using the multipoint Lander- 
Green algorithm [32], which is a HMM, incorporates 
genetic map and allele frequency information and allows 
for great flexibility in the disease model. Unlike the 
methods just mentioned, linkage analysis allows domi- 
nant, recessive or X-linked inheritance models, as well 
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as permitting variable penetrances, non-parametric ana- 
lysis and formal haplotype inference. There are few con- 
straints upon the sampling design, with unaffected 
individuals able to contribute information to parametric 
linkage analyses. The Lander-Green algorithm has pro- 
duced many important linkage results, which have facili- 
tated the identification of the underlying disease-causing 
mutations. 

We investigated whether linkage analysis using the 
Lander-Green algorithm could be performed using gen- 
otypes inferred from WES data, removing the need for 
the array-based genotyping step [33]. We inferred geno- 
types at the location of HapMap Phase II SNPs, [34] as 
this resource provides comprehensive annotation, 
including the population allele frequencies and genetic 
map positions required for linkage analysis. We adapted 
our existing software [35] to extract HapMap Phase II 
SNP genotypes from WES data and format them for 
linkage analysis. 

We anticipated two potential disadvantages to this 
approach. Firstly, exome capture only targets exonic 
SNPs, resulting in gaps in marker coverage outside of 
exons. Secondly, genotypes obtained using massively 
parallel sequencing (MPS) technologies such as WES 
tend to have a higher error rate than those obtained 
from genotyping arrays [36]. The use of erroneous geno- 
types in linkage analyses may reduce power to detect 
linkage peaks or result in false positive linkage peaks 
[37]. 

We compared the results of linkage analysis using 
array-based and exome genotypes for three families with 
different neurological disorders showing Mendelian 
inheritance (Figure 1). We sequenced the exomes of two 
affected siblings from family M, an Anglo-Saxon ances- 
try family showing autosomal dominant inheritance. 



The exome of a single affected individual, the offspring 
of first cousins, from Iranian family A was sequenced, as 
was the exome of a single affected individual, the off- 
spring of parents thought to be first cousins once 
removed, from the Pakistani family T. Families A and T 
showed recessive inheritance. Due to the consanguinity 
present in these families, we can perform linkage analy- 
sis using genotypes from a single affected individual, a 
method known as homozygosity mapping [33]. 

Results and discussion 

Exome sequencing coverage of HapMap Phase II SNPs 

Allele frequencies and genetic map positions were avail- 
able for 3,269,163 HapMap Phase II SNPs that could be 
translated to UCSC hgl9 physical coordinates. The Illu- 
mina TruSeq platform used for exome capture targeted 
61,647 of these SNPs (1.89%). After discarding indels 
and SNPs whose alleles did not match the HapMap 
annotations, a median 56,931 (92.3%) of targeted SNPs 
were covered by at least five high-quality reads (Table 
1). A median of 64,065 untargeted HapMap Phase II 
SNPs were covered by at least five reads; a median 78% 
of these untargeted SNPs were found to lie within 200 
bp of a targeted feature, comprising a median 57% of all 
untargeted HapMap SNPs within 200 bp of a targeted 
feature. 

In total, we obtained a minimum of 117,158 and a 
maximum of 133,072 SNP genotypes from the four 
exomes. The array-based genotyping interrogated 
598,821 genotypes for A-7 and T-l (Illumina Infinium 
HumanHap610W-Quad BeadChip) and 731,306 geno- 
types for M-3 and M-4 (Illumina OmniExpress Bead- 
Chip). Table 2 compares the inter-marker distances 
between exome genotypes for each sample to those for 
the genotyping array. The exome genotypes have much 
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Table 1 Number of HapMap Phase II SIMPs covered > 5 by distance to targeted base 



Distance to 
targeted base 




Number of SNPs (%) 




HapMap 


M-3 


M-4 


A-7 


T-1 


Phase II (N) 


0 bp 


56,648 (91.9) 


56,835 (92.2) 


57,027 (92.5) 


58,142 (94.3) 


61,647 


1 to 200 bp 


50,077 (56.7) 


50,805 (57.5) 


46,144 (52.2) 


57,923 (65.6) 


88,349 


> 200 bp 


13,683 (0.4) 


13,565 (0.4) 


13,987 (0.4) 


1 7,007 (0.5) 


3,119,167 


Total 


120,408 (3.7) 


121,205 (3.7) 


117,158 (3.6) 


133,072 (4.1) 


3,269,163 



The denominator for percentages is the total number of HapMap Phase II SNPs in that distance category. 



more variable inter-marker distances than the genotyp- 
ing arrays, with a smaller median value. 

Optimization of genotype concordance 

We inferred genotypes at the positions of SNPs located 
on the genotyping array used for each individual so that 
we could investigate genotype concordance between the 
two technologies. We found that ambiguous (A/T or CI 
G SNPs) comprised a high proportion of SNPs with dis- 
cordant genotypes, despite being a small proportion of 
SNPs overall. For example, for A-7 at coverage > 5 and 
t = 0.5 (see below), 77% (346 of 450) of discordant SNPs 
were ambiguous SNPs, while ambiguous SNPs com- 
posed just 2.7% of all SNPs (820 of 30,279). Such SNPs 
are prone to strand annotation errors, as the two alleles 
are the same on both strands of the SNP. We therefore 
discarded ambiguous SNPs, which left 29,459 to 52,892 
SNPs available for comparison (Table 3). 

Several popular genotype-calling algorithms for MPS 
data require the prior probability of a heterozygous gen- 
otype to be specified [38,39]. We investigated the effect 
of varying this parameter, t, upon concordance of geno- 
typing array and WES genotypes (given WES coverage > 
5; Table 3). Increasing this value from the default 0.001 
results in a modest improvement in the percentage of 
WES genotypes being correctly classified, with most of 
the improvement occurring between t = 0.001 and t = 
0.05. The highest concordance is achieved at t = 0.5, 
where all four samples achieve 99.7% concordance, com- 
pared to 98.7 to 98.9% concordance at the default t = 
0.001. 



Table 2 Intermarker distances for the two genotyping 
arrays and for exome genotypes covered > 5 





Median 


1st quartile 


3rd quartile 


lllumina OmniExpress 


2,233 


814 


5,125 


lllumina 610 


2,744 


1,019 


6,027 


M-3 


1,853 


236 


11,390 


M-4 


1,830 


235 


11,260 


A-7 


1,943 


240 


12,000 


T-1 


1,647 


227 


10,210 


Intermarker distances are in base pairs. 



We note that t = 0.5 may not be optimal for calling 
SNP genotypes on haploid chromosomes. At t = 0.5, the 
male M-4 had five x chromosome genotypes erroneously 
called as heterozygous out of 1,026 (0.49%), while the 
male T-1 had one such call out of 635 genotypes (0.16%). 
The same SNPs were not called as heterozygous by the 
genotyping arrays. No heterozygous x chromosome calls 
were observed at the default value of t = 0.001. 

Linkage analysis and LOD score concordance 

Prior to performing linkage analysis on exome and array 
SNP genotypes, we selected one SNP per 0.3 cM to ensure 
linkage equilibrium while retaining a set of SNPs dense 
enough to effectively infer inheritance. The resulting sub- 
sets of WES genotypes (Table 4) contained 8,016 to 8,402 
SNPs with average heterozygosities of 0.40 or 0.41 among 
the CEPH HapMap genotypes, obtained from Utah resi- 
dents with ancestry from northern and western Europe 
(CEU). The resulting subsets of array genotypes (Table 4) 
contained more SNPs (12,173 to 12,243), with higher aver- 
age heterozygosities (0.48 or 0.49). 

Despite this difference, there was good agreement 
between LOD scores achieved at linkage peaks using the 
different sets of genotypes (Figure 2, Table 5). The med- 
ian difference between the WES and array LOD scores 
across positions where either achieved the maximum 
score was close to zero for all three families (range 
-0.0003 to -0.002). The differences had a 95% empirical 
interval of (-0.572,0.092) for family A, with the other 
two families achieving narrower intervals (Table 5). 

Efficacy of filtering identified variants by location of 
linkage peaks 

If our genetic model is correct, then variants lying out- 
side of linkage peaks cannot be the causal mutation and 
can be discarded, thus reducing the number of candi- 
date disease-causing variants. Table 6 lists the number 
of nonsynonymous exonic variants (single nucleotide 
variants or indels) identified in each exome, as well as 
the number lying with linkage peaks identified using 
WES genotypes. The percentage of variants eliminated 
depends upon the power of the pedigree being studied: 
81.2% of variants are eliminated for the dominant family 
M, which is not very powerful; 94.5% of variants are 
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Table 3 Increasing the prior heterozygous probability modestly improves concordance between exome and array 
genotypes 



t 


M-3 (N = 52,617) 


M-4 (N = 52,892) 


A-7 (N = 29,459) 


T-1 (N = 32,763) 


0.00001 


0.9737 


0.9734 


0.9698 


0.9741 


0.001 (default) 


0.9882 


0.9874 


0.9865 


0.9885 


0.01 


0.9927 


0.9926 


0.9918 


0.9925 


0.05 


0.9951 


0.9950 


0.9942 


0.9945 


0.1 


0.9958 


0.9958 


0.9950 


0.9952 


0.2 


0.9968 


0.9965 


0.9958 


0.9961 


0.3 


0.9971 


0.9968 


0.9961 


0.9964 


0.4 


0.9973 


0.9971 


0.9964 


0.9968 


0.5 


0.9974 


0.9973 


0.9965 


0.9969 



Proportion of SNPs where WES and genotyping array genotypes are concordant for the four exomes, for varying values of f (prior probability of a heterozygous 
genotype). Conditional on coverage with > 5 reads. 



eliminated for the recessive, consanguineous family A; 
while 99.43% of variants are eliminated for the more dis- 
tantly consanguineous, recessive family T. Hence, link- 
age analysis substantially reduces the fraction of variants 
identified that are candidates for the disease-causing 
variant of interest. 

Conclusions 

Linkage analysis is of great potential benefit to WES stu- 
dies that aim to discover genetic variants resulting in 
Mendelian disorders. As variants outside of linkage 
peaks can be eliminated, it reduces the number of iden- 
tified variants that need to be investigated further. Link- 
age analysis of WES genotypes provides information 
regarding the location of the disease locus to be 
extracted from WES data even if the causal variant is 
not captured, suggesting regions of interest that may be 
targeted in follow-up studies. However, many such stu- 
dies are being published that employ less sophisticated 
substitutes for linkage analysis or do not consider 
inheritance information at all. Anecdotal evidence sug- 
gests that a substantial proportion of MPS studies of 
individuals with Mendelian disorders fail to identify a 
causal variant, though an exact number is not known 
due to publication bias. 

We describe how to extract HapMap Phase II SNP 
genotypes from massively parallel sequencing data, 



Table 4 Number and average heterozygosity of array and 
WES SNPs selected for linkage analysis 





M-3 and M-4 


A-7 


T-1 




WES Array 


WES Array 


WES Array 


SNPs available 


114,681 677,144 


117,158 593,638 


133,071 587,680 


SNPs selected 


8,016 12,173 


8,135 12,243 


8,402 12,194 


Average 
heterozygosity 


0.40 0.49 


0.40 0.48 


0.41 0.48 



Average heterozygosity refers to the HapMap CEU population and not to the 
individual being studied. For M-3 and M-4, 'SNPs available' is the number of 
SNPs covered > 5 in both individuals. 



providing software to facilitate this process and generate 
files ready to be analyzed by popular linkage programs. 
Our method allows linkage analysis to be performed 
without requiring genotyping arrays. The flexibility of 
linkage analysis means that our method can be applied 
to any disease model and a variety of sampling schemes, 
unlike existing methods of considering inheritance infor- 
mation for WES data. Linkage analysis incorporates 
population allele frequencies and genetic map positions, 
which allows superior identification of statistically unu- 
sual sharing of haplotypes between affected individuals 
in a family. 

We demonstrate linkage using WES genotypes for 
three small nuclear families - a dominant family from 
which two exomes were sequenced and two consangui- 
neous families from which a single exome was 
sequenced. As these families are not very powerful for 
linkage analysis, multiple linkage peaks with relatively 
low LOD scores were identified. Nonetheless, discarding 
variants outside of the linkage peaks eliminated between 
81.2% and 99.43% of all nonsynonymous exonic variants 
detected in these families. The number of variants 
remaining could be reduced further by applying stan- 
dard strategies, such as discarding known SNPs with 
minor allele frequencies above a certain threshold. Our 
work demonstrates the value of considering inheritance 
information, even in very small families that may con- 
sist, at the extreme, of a single inbred individual. As the 
price of exome sequencing falls, it will become feasible 
to sequence more individuals from each family, resulting 
in fewer linkage peaks with higher LOD scores. 

Exome capture using current technologies yields 
large numbers of useful SNPs for linkage mapping. 
Over half of all SNPs covered by five or more reads 
were not targeted by the exome capture platform. 
Approximately 78% of these captured untargeted SNPs 
lay within 200 bp of a targeted feature. This reflects 
the fact that fragment lengths typically exceed probe 
lengths, resulting in flanking sequences at both ends of 



Smith et al. Genome Biology 201 1, 12:R85 
http://genomebiology.com/201 1/1 2/9/R85 



Page 5 of 9 





1 

II 


1 




2 




3 




4 


5 


6 


7 


8 


9 


10 

1 


ill 


1 

1 


12 

j 


13 


14 























































































































1500 2000 
Location (cM) 



1.5 - 








2 










4 


5 


6 


— ? — 


8 


9 


1 


0 

no 

I 




1 

i 


12 

jj 


13 


14 




5 


16 


17 


18 


19 


20 


21 


22 




1.0 - 
0.5 - 
0.0 - 
























































_ 


t 






J — 














-U 
































! 











1500 2000 
Location (cM) 



T 



< 



2.0 - 
1.5 - 






1 


2 








6 


7 


8 




9 


10 


11 


12 


























1.0 - 
0.5 - 
0.0 - 






I I 




1 i 


1 


















I 






















I 











Location (cM) 



2.0 - 
1.5 - 






— 1 


2 


3 


4 


— 5 

1. 


— 6 


— ? — 


— 8— 


£ 




10 


11 


12 


13 




4 


15 


16 


17 


18 


19 


20 


21 


22 




1.0 - 
0.5 - 
0.0 - 






!! 


-4- 













































0 500 1000 1500 2000 2500 3000 3500 

Location (cM) 



M 



0.4 

8 03 

8 0.2 
t 0.1 - 



0.0 



0.4 



1 2 3 4 5 



§°" 3 

O 0.2 - 



0.0 



7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 




2000 
Location (cM) 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2122 23 





till, 



0 1000 2000 3000 

Location (cM) 

Figure 2 Genome-wide comparison of LOD scores using array-based and WES-derived genotypes for families A, T and M. 



a probe or bait being captured and sequenced. The 
serendipitous result is that a substantial number of 
non-exonic SNPs become available, which can and 
should be used for linkage analysis. 

We found that setting the prior probability of hetero- 
zygosity to 0.5 during genotype inference resulted in the 
best concordance between WES and array genotypes. 



The authors of the MAQ SNP model recommend using 
t = 0.2 for inferring genotypes at known SNPs [38], 
while the default value used to detect variants is t = 
0.001. Our results highlight the need to tailor this para- 
meter to the specific application, either genotyping or 
rare variant detection. Although we anticipated WES 
genotypes being less accurate than array genotypes, all 
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Table 5 Distribution of LOD score differences (WES - 



array) at linkage peaks 



Family 


Median 


2.5th centile 


97.5th centile 


A 


-0.0005 


-0.572 


0.092 


T 


-0.002 


-0.390 


0.035 


M 


-0.0003 


-0.117 


0.0034 



Summary of differences at analysis positions where either the WES or the 
array LOD scores reach their genome-wide maximum. 



four samples achieved a high concordance of 99.7% for 
SNPs covered by five or more reads at t = 0.5 

We found that LOD scores obtained from WES geno- 
types agreed well with those obtained from array geno- 
types from the same individual(s) at the location of 
linkage peaks, with the median difference in LOD score 
zero to two or three decimal places for all three families. 
This was despite the fact that the array-based genotype 
sets used for analysis contained more markers and had 
higher average heterozygosities than the corresponding 
WES genotype sets, reflecting the fact that genotyping 
arrays are designed to interrogate SNPs with relatively 
high minor allele frequencies that are relatively evenly 
spaced throughout the genome. By contrast, genotypes 
extracted from WES data tend to be clustered around 
exons, resulting in fewer and less heterozygous markers 
after pruning to achieve linkage equilibrium. We con- 
clude that if available, array-based genotypes from a 
high resolution SNP array are preferable to WES geno- 
types; but if not, linkage analysis of WES genotypes pro- 
duces acceptable results. 

Once WGS is more economical, we will be able to 
perform linkage analysis using genotypes extracted from 
WGS data, which will obviate the problem of gaps in 
SNP coverage outside of exons. The software tools we 
provide can accommodate WGS genotypes without 
requiring modification. In the future, initiatives such as 
the 1000 Genomes Project [1] may provide population- 
specific allele frequencies for SNPs not currently 
included in HapMap, further increasing the number of 
SNPs available for analyses, as well as the number of 
populations studied. 

The classic Lander-Green algorithm requires markers 
to be in linkage equilibrium [40]. Modeling linkage dis- 
equilibrium would allow incorporation of all markers 
without the need to select a subset of markers in 



linkage equilibrium. This would allow linkage mapping 
using distant relationships, such as distantly inbred 
individuals who would share a sub-linkage (< 1 cM) 
tract of DNA homozygous by descent. Methods that 
incorporate linkage disequilibrium have already been 
proposed, including a variable length HMM that can 
be applied to detect distantly related individuals [41]. 
Further work is being targeted towards approximations 
of distant relationships to connect sets of related pedi- 
grees [42]. These methods will extract the maximum 
information from MPS data from individuals with 
inherited diseases. 

We have integrated the relatively new field of MPS in 
families with classical linkage analysis. Where feasible, 
we strongly advocate the use of linkage mapping in 
combination with MPS studies that aim to discover var- 
iants causing Mendelian disorders. This approach does 
not require purpose-built HMMs, but can utilize exist- 
ing software implementations of the Lander-Green algo- 
rithm. Where genotyping array genotypes are not 
available, we recommend utilizing MPS data to their full 
capacity by using MPS genotypes to perform linkage 
analysis. This will reduce the number of candidate dis- 
ease-causing variants that need to be evaluated further. 
Should the causal variant not be identified by a WES 
study, linkage analysis will highlight regions of the gen- 
ome where targeted resequencing is most likely to iden- 
tify this variant. 

Materials and methods 

Informed consent, DNA extraction and array-based 
genotyping 

Written informed consent was provided by the four par- 
ticipants or their parents. Ethics approval was provided 
by the Royal Children's Hospital Research Ethics Com- 
mittee (HREC reference number 28097) in Melbourne. 
Genomic DNA was extracted from participants' blood 
samples using the Nucleon™ BACC Genomic DNA 
Extraction Kit (GE Healthcare, Little Chalfont, Buckin- 
ghamshire, England). 

All four individuals were genotyped using Illumina 
Infinium HumanHap610W-Quad BeadChip (A-7, T-l) 
or OmniExpress (M-3, M-4) genotyping arrays (fee for 
service, Australian Genome Research Facility, Mel- 
bourne, Victoria, Australia). These arrays interrogate 



Table 6 Efficacy of variant elimination due to linkage peak filtering 

Family Model Consanguinity Number of Max Number of not Number of (%) not synonymous exonic 

linkage peaks LOD synonymous exonic variants in linkage regions 

variants 



A Recessive First cousin offspring 15 1.2 10,982 604 (5.50) 

T Recessive First cousins once 5 1.51 11,353 65 (0.57) 
removed offspring 

M Dominant None 41 0.3 13,186 2,478(18.79) 
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598,821 and 731,306 SNPs respectively, with 342,956 
markers in common. Genotype calls were generated 
using version 6.3.0 of the GenCall algorithm implemen- 
ted in Illumina BeadStudio. A GenCall score cutoff (no- 
call threshold) of 0.15 was used. 

Exome capture, sequencing and alignment 

Target DNA for the four individuals was captured using 
Illumina TruSeq, which is designed to capture a target 
region of 62,085,286 bp (2.00% of the genome), and 
sequenced using an Illumina HiSeq machine (fee for ser- 
vice, Axeq Technologies, Rockville, MD, United States). 
Individual T-l was sequenced using one-quarter of a 
flow cell lane while the other three individuals were 
sequenced using one-eighth of a lane. Paired-end reads 
of 110 bp were generated. 

Reads were aligned to UCSC hgl9 using Novoalign 
version 2.07.05 [43]. Quality score recalibration was per- 
formed during alignment, and reads that aligned to mul- 
tiple locations were discarded. Following alignment, 
presumed PCR duplicates were removed using MarkDu- 
plicates.jar from Picard [44]. Table SI in Additional file 

I shows the number of reads at each stage of proces- 
sing, while Tables S2 and S3 in the same file show cov- 
erage statistics for the four exomes. 

WES genotype inference and linkage analysis 

SNP genotypes were inferred from WES data using the 
samtools mpileup and bcftools view commands from 
release 916 of the SAMtools package [45], which infers 
genotypes using a revised version of the MAQ SNP model 
[38]. We required base quality and mapping quality > 13. 
SAMtools produces a variant call format (VCF) file, from 
which we extracted genotypes using a Perl script. 

These genotypes were formatted for linkage analysis 
using a modified version of the Perl script linkdatagen.pl 
[35] with an annotation file prepared for HapMap Phase 

II SNPs. This script chose one SNP per 0.3 cM to be 
used for analysis, with SNPs selected to maximize het- 
erozygosity according to CEU HapMap genotypes [34]. 
Array-based genotypes were prepared for linkage analy- 
sis in the same way, using annotation files for the appro- 
priate array. 

The two Perl scripts used to extract genotypes from 
VCF files and format them for linkage analysis are freely 
available on our website [46], as is the annotation file 
for HapMap Phase II SNPs. Users may also download 
VCF files containing WES SNP genotypes for the four 
individuals described here (both for HapMap Phase II 
and genotyping array SNPs), as well as files containing 
genotyping array genotypes for comparison. 

Multipoint parametric linkage analysis using WES and 
array genotypes was performed using MERLIN [47]. A 
population disease allele frequency of 0.00001 was 



specified, along with a fully penetrant recessive (family 
A, family T) or dominant (family M) genetic model. 
LOD scores were estimated at positions spaced 0.3 cM 
apart, and CEU allele frequencies were used. 

WES variant detection 

SAMtools mpileup/bcftools was also used to detect var- 
iants from the reference sequence with the default set- 
ting of £ = 0.001. Variants were annotated by 
ANNOVAR [48] using the UCSC Known Gene annota- 
tion. For the purposes of filtering variants, linkage peaks 
were defined as the intervals in which the genome-wide 
maximum LOD score was obtained, plus 0.3 cM on 
either side. 

Additional material 

c \ 

Additional file 1: Supplementary tables. 
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